Approximate time: 60 minutes
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs1. The correct identification of which genes/transcripts are changing in expression between specific conditions is key in our understanding of the biological processes that are affected.
In this workshop, we will walk you through an end-to-end gene-level RNA-seq differential expression workflow using various R packages. We will start with reading in data obtained from Salmon, convert pseudocounts to counts, perform exploratory data analysis for quality assessment and to explore the relationship between samples, perform differential expression analysis, and visually explore the results prior to performing downstream functional analysis.
For this workshop, we will be using a publicly available RNA-Seq dataset that is part of a larger study described in Kenny PJ et al, Cell Rep 2014.
The RNA-Seq was performed on HEK293F cells that were either transfected with a MOV10 transgene, or siRNA to knock down Mov10 expression, or non-specific (irrelevant) siRNA. This resulted in 3 conditions Mov10 oe (over expression), Mov10 kd (knock down) and Irrelevant kd, respectively. The number of replicates is as shown below.
Using these data, we will evaluate transcriptional patterns associated with perturbation of MOV10 expression. Please note that the irrelevant siRNA will be treated as our control condition.
What is the purpose of these datasets? What does Mov10 do?
The authors are investigating interactions between various genes involved in Fragile X syndrome, a disease in which there is aberrant production of the FMRP protein.
FMRP is “most commonly found in the brain, is essential for normal cognitive development and female reproductive function. Mutations of this gene can lead to fragile X syndrome, mental retardation, premature ovarian failure, autism, Parkinson’s disease, developmental delays and other cognitive deficits.” - from wikipedia
MOV10, is a putative RNA helicase that is also associated with FMRP in the context of the microRNA pathway.
The hypothesis the paper is testing is that FMRP and MOV10 associate and regulate the translation of a subset of RNAs.
Our questions: * What patterns of expression can we identify with the loss or gain of MOV10? * Are there any genes shared between the two conditions?
For this dataset, raw sequence reads were obtained from the Sequence Read Archive (SRA). These reads were then processed using the RNA-seq workflow as detailed in the pre-reading for this workshop. All steps were performed on the command line (Linux/Unix), including a thorough quality control assessment. If you are interested, we have the MultiQC html report for this dataset linked here for you to peruse.
The directories of output from the mapping/quantification step of the workflow (Salmon) is the data that we will be using. These transcript abundance estimates, often referred to as ‘pseudocounts’, will be the starting point for our differential gene expression analysis.
Let’s get started by opening up RStudio and setting up a new project for this analysis.
File menu and select
New Project.New Project window, choose
New Directory. Then, choose New Project. Name
your new directory DEanalysis and then “Create the project
as subdirectory of:” the Desktop (or location of your choice).